Covid19 Japanが独自に収集している陽性者単位のデータ(個票データ)。ソースとデータは全てGitHubにて公開されており、データはJSON形式。「レコード数 \(\neq\) 累計陽性者数」であることに注意。
Covid19 JapanがGitHubで公開しているデータは前述のようにJSON形式であり、最新データはlatest.jsonファイルで示されている。このため、読み込む際はひと工夫必要。
陽性者単位の個票データ。
path <- "https://raw.githubusercontent.com/reustle/covid19japan-data/master/docs/patient_data/"
df <- path %>%
paste0("latest.json") %>%
readr::read_lines() %>%
paste0(path, .) %>%
jsonlite::fromJSON()
df
死亡者数や重症者数などの推移データはsummaryフォルダ内のJSON形式ファイルにまとめられている。summaryフォルダの他にsummary_minフォルダというフォルダがあるが、summary_minフォルダ内のJSONファイルは単に改行を省略して小さくしたファイル。
path <- "https://raw.githubusercontent.com/reustle/covid19japan-data/master/docs/summary/"
df_s <- path %>%
paste0("latest.json") %>%
readr::read_lines() %>%
paste0(path, .) %>%
jsonlite::fromJSON()
df_s %>% summary()
## Length Class Mode
## prefectures 27 data.frame list
## regions 12 data.frame list
## daily 37 data.frame list
## updated 1 -none- character
要約すると分かるように3つのデータフレーム(都道府県単位、八地方区分単位、日次)と一つのベクトル(更新日時)から構成されている。
更新日次時点における都道府県単位での累積値。陽性者・死亡者などの時系列集計データはネストで格納されている。
厚生労働省のオープンデータが集計から除いている空港検疫・ダイヤモンドプリンセス・長崎クルーズ船・その他を含めて全51区分。
df_s$prefectures
更新日次時点における八地方区分単位での累積値。陽性者・死亡者などの時系列集計データは都道府県単位と同様にネストで格納されている。
ただし、確認した時点(2020/11/3)では、時系列集計値の合計と累積値が一致しない。
df_s$regions
個票データを日次で集計したもの。累積値の他に移動平均も含まれているが、暗黙の欠落を含んだデータである点に注意が必要。
df_s$daily
の更新日時が記録されている。
df_s$updated
## [1] "2020-11-03T12:23:54+09:00"
最初にデータがどのようになっているか確認する。これには要約に便利なskimrパッケージを用いる。
df %>%
skimr::skim()
| Name | Piped data |
| Number of rows | 104091 |
| Number of columns | 23 |
| _______________________ | |
| Column type frequency: | |
| character | 19 |
| logical | 3 |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| patientId | 0 | 1.00 | 1 | 8 | 0 | 102432 | 0 |
| dateAnnounced | 0 | 1.00 | 10 | 10 | 0 | 280 | 0 |
| gender | 14184 | 0.86 | 1 | 1 | 0 | 2 | 0 |
| detectedPrefecture | 0 | 1.00 | 3 | 15 | 0 | 49 | 0 |
| patientStatus | 100135 | 0.04 | 8 | 23 | 0 | 8 | 0 |
| notes | 53917 | 0.48 | 1 | 270 | 0 | 47412 | 1 |
| mhlwPatientNumber | 103642 | 0.00 | 1 | 11 | 0 | 434 | 0 |
| prefecturePatientNumber | 12022 | 0.88 | 5 | 20 | 0 | 92060 | 0 |
| prefectureSourceURL | 72787 | 0.30 | 5 | 224 | 0 | 3439 | 0 |
| residence | 22026 | 0.79 | 1 | 38 | 0 | 1422 | 0 |
| sourceURL | 637 | 0.99 | 1 | 239 | 0 | 7941 | 0 |
| relatedPatients | 93695 | 0.10 | 2 | 259 | 0 | 6345 | 0 |
| knownCluster | 101609 | 0.02 | 3 | 88 | 0 | 229 | 0 |
| detectedCityTown | 78092 | 0.25 | 2 | 22 | 0 | 663 | 0 |
| cityPrefectureNumber | 78357 | 0.25 | 1 | 34 | 0 | 25725 | 2 |
| citySourceURL | 92259 | 0.11 | 9 | 317 | 0 | 3637 | 0 |
| deceasedDate | 102294 | 0.02 | 10 | 10 | 0 | 229 | 0 |
| deceasedReportedDate | 102877 | 0.01 | 10 | 62 | 0 | 204 | 0 |
| deathSourceURL | 103021 | 0.01 | 14 | 123 | 0 | 651 | 0 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| confirmedPatient | 0 | 1 | 0.98 | TRU: 102431, FAL: 1660 |
| charterFlightPassenger | 104077 | 0 | 1.00 | TRU: 14 |
| cruisePassengerDisembarked | 104080 | 0 | 1.00 | TRU: 11 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| ageBracket | 0 | 1 | 32.65 | 23.57 | -1 | 20 | 30 | 50 | 100 | ▅▇▅▂▁ |
元がJSON形式なので、読み込んだ直後は殆どの変量(フィーチャー)が文字型になっていることが分かる。また、意外と欠損が多いことも分かる。
各変量(フィーチャー)を適切な形式に変換し、地域区分でも分析できるように都道府県データと結合する。
x <- df %>%
dplyr::select(patientId, date = dateAnnounced, gender,
pref = detectedPrefecture, patientStatus, knownCluster,
confirmedPatient, charterFlightPassenger,
cruisePassengerDisembarked, ageBracket,
deceasedDate, deceasedReportedDate) %>%
dplyr::filter(confirmedPatient == TRUE) %>%
dplyr::mutate(date = lubridate::as_date(date),
gender = forcats::as_factor(gender),
patientStatus = forcats::as_factor(patientStatus),
cluster = dplyr::if_else(!is.na(knownCluster), TRUE, FALSE),
ageBracket = forcats::as_factor(ageBracket),
deceasedDate = lubridate::as_date(deceasedDate),
deceasedReportedDate = lubridate::as_date(deceasedReportedDate)) %>%
dplyr::left_join(prefs, by = c("pref" = "pref")) %>%
dplyr::select(-`推計人口`) %>%
dplyr::rename(Pref = `都道府県`, region = `八地方区分`)
## Warning: Problem with `mutate()` input `deceasedReportedDate`.
## ℹ 2 failed to parse.
## ℹ Input `deceasedReportedDate` is `lubridate::as_date(deceasedReportedDate)`.
## Warning: 2 failed to parse.
x
変換結果を要約してみると
x %>%
skimr::skim()
| Name | Piped data |
| Number of rows | 102431 |
| Number of columns | 19 |
| _______________________ | |
| Column type frequency: | |
| character | 3 |
| Date | 3 |
| factor | 9 |
| logical | 4 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| patientId | 0 | 1.00 | 1 | 8 | 0 | 102431 | 0 |
| pref | 0 | 1.00 | 3 | 15 | 0 | 49 | 0 |
| knownCluster | 99978 | 0.02 | 3 | 88 | 0 | 227 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| date | 0 | 1 | 2020-01-15 | 2020-11-03 | 2020-08-12 | 280 |
| deceasedDate | 102052 | 0 | 2020-02-13 | 2020-10-17 | 2020-05-08 | 150 |
| deceasedReportedDate | 102101 | 0 | 2020-02-13 | 2020-10-17 | 2020-05-16 | 131 |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| gender | 13547 | 0.87 | FALSE | 2 | M: 49916, F: 38968 |
| patientStatus | 99898 | 0.02 | FALSE | 8 | Hos: 1261, Dec: 371, Hom: 315, Dis: 283 |
| ageBracket | 0 | 1.00 | FALSE | 13 | 20: 24465, 30: 15417, -1: 13646, 40: 12751 |
| pcode | 1203 | 0.99 | FALSE | 47 | 13: 31323, 27: 12950, 14: 8812, 23: 6326 |
| Pref | 1203 | 0.99 | FALSE | 47 | 東京都: 31323, 大阪府: 12950, 神奈川: 8812, 愛知県: 6326 |
| region | 1203 | 0.99 | FALSE | 8 | 関東地: 53296, 近畿地: 20358, 九州地: 10913, 中部地: 9919 |
| 広域圏 | 7869 | 0.92 | FALSE | 8 | 首都圏: 53514, 近畿圏: 19791, 中部圏: 8592, 九州圏: 7537 |
| 通俗的区分 | 1203 | 0.99 | FALSE | 11 | 関東: 53296, 関西: 19791, 東海: 8249, 九州: 7537 |
| fct_pref | 1203 | 0.99 | FALSE | 47 | Tok: 31323, Osa: 12950, Kan: 8812, Aic: 6326 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| confirmedPatient | 0 | 1 | 1.00 | TRU: 102431 |
| charterFlightPassenger | 102417 | 0 | 1.00 | TRU: 14 |
| cruisePassengerDisembarked | 102420 | 0 | 1.00 | TRU: 11 |
| cluster | 0 | 1 | 0.02 | FAL: 99978, TRU: 2453 |
文字型を因子型に変換するだけでも大まかな傾向が見えるようになる。例えば
ことが読める。
patientStatusは以下の通りで、ほぼ更新されていないのと思われる。死者数などの推移を見る場合はサマリデータを使った方がいい。
x %>%
dplyr::group_by(patientStatus) %>%
dplyr::summarise(n = n()) %>%
dplyr::ungroup() %>%
dplyr::mutate(Japanese = c("回復", "入院中", "退院済", "死亡", "詳細不明",
"重症", "自宅療養", "ホテル療養", NA))
都道府県別の総陽性者数と人口千人あたりの陽性者率を求める。
x %>%
dplyr::group_by(Pref) %>%
dplyr::summarise(n = n()) %>%
dplyr::left_join(prefs, by = c("Pref" = "都道府県")) %>%
dplyr::select(Pref, n, population = `推計人口`) %>%
dplyr::mutate(rate = round(n / population, 2))
上位10県を累計人数と人口千人あたりの陽性者数で比べてみる。
x %>%
dplyr::group_by(Pref) %>%
dplyr::summarise(n = n()) %>%
dplyr::left_join(prefs, by = c("Pref" = "都道府県")) %>%
dplyr::select(Pref, n, population = `推計人口`) %>%
dplyr::mutate(rate = round(n / population, 2)) %>%
dplyr::slice_max(order_by = n, n = 10) %>%
dplyr::rename(`累計陽性者数` = n, `推計人口[千人]` = population, `率` = rate)
x %>%
dplyr::group_by(Pref) %>%
dplyr::summarise(n = n()) %>%
dplyr::left_join(prefs, by = c("Pref" = "都道府県")) %>%
dplyr::select(Pref, n, population = `推計人口`) %>%
dplyr::mutate(rate = round(n / population, 2)) %>%
dplyr::slice_max(order_by = rate, n = 10) %>%
dplyr::rename(`累計陽性者数` = n, `推計人口[千人]` = population, `率` = rate)
累計の陽性者数は、ほぼ、人口に比例しているが、一部の県での感染率が高いことが分かる。
地方区分で比較すると都道府県と同様に人口の多い関東、近畿はともかく、九州、北海道の陽性者率が高いことが分かる。
region <- prefs %>%
dplyr::group_by(`八地方区分`) %>%
dplyr::summarise(population = sum(`推計人口`)) %>%
dplyr::rename(region = `八地方区分`)
x %>%
dplyr::group_by(region) %>%
dplyr::summarise(n = n()) %>%
dplyr::left_join(region, by = c("region" = "region")) %>%
dplyr::select(region, n, population) %>%
dplyr::mutate(rate = round(n / population, 2))
日次の陽性者数、前日比、累計を求める。
x %>%
dplyr::group_by(date) %>%
dplyr::summarise(n = n()) %>%
tidyr::complete(date = seq.Date(from = min(date), to = max(date), by = "day"),
fill = list(n = 0L)) %>%
dplyr::mutate(diff = n - dplyr::lag(n, default = 0L), cum = cumsum(n))
x_prefs <- x %>%
dplyr::group_by(date, Pref) %>%
dplyr::summarise(n = n()) %>%
dplyr::ungroup() %>%
tidyr::pivot_wider(names_from = Pref, values_from = n, values_fill = 0L) %>%
tidyr::pivot_longer(cols = -date, names_to = "Pref", values_to = "n")
x_prefs
lagdiff <- function(n) {
n - dplyr::lag(n, default = 0L)
}
x_prefs_diff <- x_prefs %>%
tidyr::pivot_wider(names_from = Pref, values_from = n, values_fill = 0L) %>%
dplyr::mutate_if(is.integer, .funs = lagdiff) %>%
tidyr::pivot_longer(cols = -date, names_to = "Pref", values_to = "diff")
x_prefs_diff
x_prefs_cum <- x_prefs %>%
tidyr::pivot_wider(names_from = Pref, values_from = n, values_fill = 0L) %>%
dplyr::mutate_if(is.integer, .funs = cumsum) %>%
tidyr::pivot_longer(cols = -date, names_to = "Pref", values_to = "cum")
x_prefs_cum
x_by_prefs <- x_prefs %>%
dplyr::left_join(x_prefs_diff, by = c("date" = "date", "Pref" = "Pref")) %>%
dplyr::left_join(x_prefs_cum, by = c("date" = "date", "Pref" = "Pref")) %>%
dplyr::left_join(prefs, ., by = c("都道府県" = "Pref")) %>%
dplyr::mutate(Pref = forcats::fct_inorder(`都道府県`)) %>%
dplyr::select(date, Pref, n, diff, cum) %>%
dplyr::arrange(date)
x_by_prefs
x_region <- x %>%
dplyr::group_by(date, region) %>%
dplyr::summarise(n = n()) %>%
dplyr::ungroup() %>%
tidyr::pivot_wider(names_from = region, values_from = n, values_fill = 0L) %>%
tidyr::pivot_longer(cols = -date, names_to = "region", values_to = "n")
x_region
lagdiff <- function(n) {
n - dplyr::lag(n, default = 0L)
}
x_region_diff <- x_region %>%
tidyr::pivot_wider(names_from = region, values_from = n, values_fill = 0L) %>%
dplyr::mutate_if(is.integer, .funs = lagdiff) %>%
tidyr::pivot_longer(cols = -date, names_to = "region", values_to = "diff")
x_region_diff
x_region_cum <- x_region %>%
tidyr::pivot_wider(names_from = region, values_from = n, values_fill = 0L) %>%
dplyr::mutate_if(is.integer, .funs = cumsum) %>%
tidyr::pivot_longer(cols = -date, names_to = "region", values_to = "cum")
x_region_cum
x_by_region <- x_region %>%
dplyr::left_join(x_region_diff, by = c("date" = "date", "region" = "region")) %>%
dplyr::left_join(x_region_cum, by = c("date" = "date", "region" = "region")) %>%
dplyr::left_join(prefs, ., by = c("八地方区分" = "region")) %>%
dplyr::mutate(region = forcats::fct_inorder(`八地方区分`)) %>%
dplyr::select(date, region, n, diff, cum) %>%
dplyr::arrange(date)
x_by_region
x %>%
dplyr::group_by(Pref) %>%
dplyr::summarise(n = n()) %>%
dplyr::left_join(prefs, by = c("Pref" = "都道府県")) %>%
dplyr::select(Pref, n, population = `推計人口`) %>%
dplyr::mutate(rate = round(n / population, 2)) %>%
ggplot2::ggplot(ggplot2::aes(x = population, y = n) ) +
ggplot2::geom_point(ggplot2::aes(colour = Pref)) +
ggrepel::geom_text_repel(ggplot2::aes(label = Pref, colour = Pref)) +
ggplot2::theme(legend.position = 'none') +
ggplot2::labs(title = "", x = "推計人口[千人]", y = "累計陽性者数")
## Warning: Removed 1 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing missing values (geom_text_repel).
x %>%
dplyr::group_by(Pref) %>%
dplyr::summarise(n = n()) %>%
dplyr::left_join(prefs, by = c("Pref" = "都道府県")) %>%
dplyr::select(Pref, n, population = `推計人口`) %>%
dplyr::mutate(rate = round(n / population, 2)) %>%
dplyr::filter(n < 1000) %>%
# dplyr::slice_min(order_by = n, n = 38) %>%
ggplot2::ggplot(ggplot2::aes(x = population, y = n) ) +
ggplot2::geom_point(ggplot2::aes(colour = Pref)) +
ggrepel::geom_text_repel(ggplot2::aes(label = Pref, colour = Pref)) +
ggplot2::theme(legend.position = 'none') +
ggplot2::labs(title = "累計陽性者千人未満", x = "推計人口[千人]", y = "累計陽性者数")
region <- prefs %>%
dplyr::group_by(`八地方区分`) %>%
dplyr::summarise(population = sum(`推計人口`)) %>%
dplyr::rename(region = `八地方区分`)
x %>%
dplyr::group_by(region) %>%
dplyr::summarise(n = n()) %>%
dplyr::left_join(region, by = c("region" = "region")) %>%
dplyr::select(region, n, population) %>%
dplyr::mutate(rate = round(n / population, 2)) %>%
ggplot2::ggplot(ggplot2::aes(x = population, y = n) ) +
ggplot2::geom_point(ggplot2::aes(colour = region)) +
ggrepel::geom_text_repel(ggplot2::aes(label = region, colour = region)) +
ggplot2::theme(legend.position = 'none') +
ggplot2::labs(title = "", x = "推計人口[千人]", y = "累計陽性者数")
## Warning: Removed 1 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing missing values (geom_text_repel).
sec_scale <- 100
ncol <- 5
x_by_prefs %>%
ggplot2::ggplot(ggplot2::aes(x = date)) +
ggplot2::geom_bar(ggplot2::aes(y = n, fill = Pref), stat = "identity",
alpha = 0.25, width = 1.0) +
ggplot2::geom_line(ggplot2::aes(y = cum / sec_scale, colour = Pref)) +
ggplot2::facet_wrap(~ Pref, ncol = ncol) +
ggplot2::theme(legend.position = 'none') +
ggplot2::labs(title = "Fixed scale", x = "", y = "") +
ggplot2::scale_y_continuous(
name = "陽性者(単日)",
sec.axis = ggplot2::sec_axis(~ . * sec_scale,
name = "累積陽性者数(折線)")
)
x_by_prefs %>%
ggplot2::ggplot(ggplot2::aes(x = date)) +
ggplot2::geom_bar(ggplot2::aes(y = n, fill = Pref), stat = "identity",
alpha = 0.25, width = 1.0) +
ggplot2::geom_line(ggplot2::aes(y = cum / sec_scale, colour = Pref)) +
ggplot2::facet_wrap(~ Pref, ncol = ncol, scales = "free_y") +
ggplot2::theme(legend.position = 'none') +
ggplot2::labs(title = "Free Y scale", x = "", y = "") +
ggplot2::scale_y_continuous(
name = "陽性者(単日)",
sec.axis = ggplot2::sec_axis(~ . * sec_scale,
name = "累積陽性者数(折線)")
)
sec_scale <- 10
x_by_region %>%
ggplot2::ggplot(ggplot2::aes(x = date)) +
ggplot2::geom_bar(ggplot2::aes(y = n, fill = region), stat = "identity",
alpha = 0.25, width = 1.0) +
ggplot2::geom_line(ggplot2::aes(y = cum / sec_scale, colour = region)) +
ggplot2::facet_wrap(~ region) +
ggplot2::theme(legend.position = 'none') +
ggplot2::labs(title = "Fixed scale", x = "", y = "") +
ggplot2::scale_y_continuous(
name = "陽性者(単日)",
sec.axis = ggplot2::sec_axis(~ . * sec_scale,
name = "累積陽性者数(折線)")
)
x_by_region %>%
ggplot2::ggplot(ggplot2::aes(x = date)) +
ggplot2::geom_bar(ggplot2::aes(y = n, fill = region), stat = "identity",
alpha = 0.25, width = 1.0) +
ggplot2::geom_line(ggplot2::aes(y = cum / sec_scale, colour = region)) +
ggplot2::facet_wrap(~ region, scales = "free_y") +
ggplot2::theme(legend.position = 'none') +
ggplot2::labs(title = "Free Y scale", x = "", y = "") +
ggplot2::scale_y_continuous(
name = "陽性者(単日)",
sec.axis = ggplot2::sec_axis(~ . * sec_scale,
name = "累積陽性者数(折線)")
)